# Cross-modal Understanding
Internvl3 78B Hf
Other
InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text
Transformers Other

I
OpenGVLab
40
1
Cephalo Gemma 3 4b It 04 16 2025
Cephalo-Gemma-3-4b is a vision-language model specialized in biomaterials and spider silk analysis, fine-tuned based on the Gemma architecture.
Image-to-Text
Transformers

C
lamm-mit
17
1
Qwen2.5 Omni 7B
Other
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.
Multimodal Fusion
Transformers English

Q
Qwen
206.20k
1,522
Centurio Aya
Centurio is an open-source multilingual large vision-language model supporting 100 languages, capable of processing image-to-text and text-to-text tasks.
Image-to-Text
Transformers Supports Multiple Languages

C
WueNLP
29
4
Thaicapgen Clip Gpt2
An encoder-decoder model based on CLIP encoder and GPT2 architecture for generating Thai image descriptions
Image-to-Text Other
T
Natthaphon
18
0
Chameleon 30b
Other
Meta Chameleon is a hybrid-modal early fusion foundation model developed by FAIR, supporting multimodal processing of images and text.
Multimodal Fusion
Transformers

C
facebook
102
86
Final Model
Apache-2.0
This model is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.
Text Recognition
Transformers

F
goatrider
17
0
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies
Image-to-Text
Transformers

B
movementso
18
0
General Image Captioning
Apache-2.0
This is an image-to-text model based on the Apache-2.0 license, capable of converting image content into textual descriptions.
Text Recognition
Transformers Other

G
alibidaran
30
0
CLIP ViT B 16 DataComp.XL S13b B90k
MIT
This is a CLIP ViT-B/16 model trained using OpenCLIP on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval.
Text-to-Image
C
laion
4,461
7
Pix2struct Docvqa Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on image-text pairs, supporting various tasks including image captioning and visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

P
google
8,601
37
Mscoco Finetuned CoCa ViT L 14 Laion2b S13b B90k
MIT
This is an image-to-text model based on the MIT license, capable of converting image content into textual descriptions.
Image-to-Text
M
laion
21.02k
20
Vinvl Base Image Captioning
Apache-2.0
Microsoft's VinVL foundational pre-trained model, specifically designed for image captioning tasks, with strong visual-language understanding capabilities.
Image-to-Text
V
michelecafagna26
45
1
Chinese Clip Vit Large Patch14 336px
Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.
Text-to-Image
Transformers

C
OFA-Sys
713
23
Molt5 Base
Apache-2.0
molt5-base is a model based on the T5 architecture, specifically designed for translation tasks between molecules and natural language.
Machine Translation
Transformers

M
laituan245
3,617
1
Featured Recommended AI Models